#This is a report about prediction of stroke. Original data came from Coursera’s Course - “Build and deploy a stroke prediction model using R”.
Stroke is a leading cause of death and disability worldwide, with significant public health implications. Some key facts about stroke:
The dataset came from Kaggle via Coursera and includes information on patients such as:
The dataset consists of 5110 observations, with 249 patients experiencing a stroke. This imbalance indicates that only 4.87% of the total observations involve stroke occurrences.
A few general informations about dataset:
dim(stroke)
## [1] 5110 12
head(stroke)
## id gender age hypertension heart_disease ever_married work_type
## 1 9046 Male 67 0 1 Yes Private
## 2 51676 Female 61 0 0 Yes Self-employed
## 3 31112 Male 80 0 1 Yes Private
## 4 60182 Female 49 0 0 Yes Private
## 5 1665 Female 79 1 0 Yes Self-employed
## 6 56669 Male 81 0 0 Yes Private
## Residence_type avg_glucose_level bmi smoking_status stroke
## 1 Urban 228.69 36.60 formerly smoked 1
## 2 Rural 202.21 28.89 never smoked 1
## 3 Rural 105.92 32.50 never smoked 1
## 4 Urban 171.23 34.40 smokes 1
## 5 Rural 174.12 24.00 never smoked 1
## 6 Urban 186.21 29.00 formerly smoked 1
str(stroke)
## 'data.frame': 5110 obs. of 12 variables:
## $ id : int 9046 51676 31112 60182 1665 56669 53882 10434 27419 60491 ...
## $ gender : chr "Male" "Female" "Male" "Female" ...
## $ age : num 67 61 80 49 79 81 74 69 59 78 ...
## $ hypertension : int 0 0 0 0 1 0 1 0 0 0 ...
## $ heart_disease : int 1 0 1 0 0 0 1 0 0 0 ...
## $ ever_married : chr "Yes" "Yes" "Yes" "Yes" ...
## $ work_type : chr "Private" "Self-employed" "Private" "Private" ...
## $ Residence_type : chr "Urban" "Rural" "Rural" "Urban" ...
## $ avg_glucose_level: num 229 202 106 171 174 ...
## $ bmi : num 36.6 28.9 32.5 34.4 24 ...
## $ smoking_status : chr "formerly smoked" "never smoked" "never smoked" "smokes" ...
## $ stroke : int 1 1 1 1 1 1 1 1 1 1 ...
names(stroke)
## [1] "id" "gender" "age"
## [4] "hypertension" "heart_disease" "ever_married"
## [7] "work_type" "Residence_type" "avg_glucose_level"
## [10] "bmi" "smoking_status" "stroke"
summary(stroke)
## id gender age hypertension
## Min. : 67 Length:5110 Min. : 0.08 Min. :0.00000
## 1st Qu.:17741 Class :character 1st Qu.:25.00 1st Qu.:0.00000
## Median :36932 Mode :character Median :45.00 Median :0.00000
## Mean :36518 Mean :43.23 Mean :0.09746
## 3rd Qu.:54682 3rd Qu.:61.00 3rd Qu.:0.00000
## Max. :72940 Max. :82.00 Max. :1.00000
## heart_disease ever_married work_type Residence_type
## Min. :0.00000 Length:5110 Length:5110 Length:5110
## 1st Qu.:0.00000 Class :character Class :character Class :character
## Median :0.00000 Mode :character Mode :character Mode :character
## Mean :0.05401
## 3rd Qu.:0.00000
## Max. :1.00000
## avg_glucose_level bmi smoking_status stroke
## Min. : 55.12 Min. :10.30 Length:5110 Min. :0.00000
## 1st Qu.: 77.25 1st Qu.:23.80 Class :character 1st Qu.:0.00000
## Median : 91.89 Median :28.40 Mode :character Median :0.00000
## Mean :106.15 Mean :28.89 Mean :0.04873
## 3rd Qu.:114.09 3rd Qu.:32.80 3rd Qu.:0.00000
## Max. :271.74 Max. :97.60 Max. :1.00000
skim(stroke)
| Name | stroke |
| Number of rows | 5110 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 7 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| gender | 0 | 1 | 4 | 6 | 0 | 3 | 0 |
| ever_married | 0 | 1 | 2 | 3 | 0 | 2 | 0 |
| work_type | 0 | 1 | 7 | 13 | 0 | 5 | 0 |
| Residence_type | 0 | 1 | 5 | 5 | 0 | 2 | 0 |
| smoking_status | 0 | 1 | 6 | 15 | 0 | 4 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 36517.83 | 21161.72 | 67.00 | 17741.25 | 36932.00 | 54682.00 | 72940.00 | ▇▇▇▇▇ |
| age | 0 | 1 | 43.23 | 22.61 | 0.08 | 25.00 | 45.00 | 61.00 | 82.00 | ▅▆▇▇▆ |
| hypertension | 0 | 1 | 0.10 | 0.30 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| heart_disease | 0 | 1 | 0.05 | 0.23 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| avg_glucose_level | 0 | 1 | 106.15 | 45.28 | 55.12 | 77.24 | 91.88 | 114.09 | 271.74 | ▇▃▁▁▁ |
| bmi | 0 | 1 | 28.89 | 7.70 | 10.30 | 23.80 | 28.40 | 32.80 | 97.60 | ▇▇▁▁▁ |
| stroke | 0 | 1 | 0.05 | 0.22 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
The average age in the study group is
43.3. The average glucose level in this study group is
106.15. In the study group, one person marked the
gender “other”. The record was removed to make the data visualization
more readable. Ultimately, 5,109 observations were obtained.
Missing values were found in the “bmi” column. It was decided to enter the average of all BMI measurements in the missing fields and round it to two decimal places.
The charts below show the distribution of demographic variables: gender, place of residence and place of work.
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The charts below show the distribution of health related variables: hypertension, heart disease, smoking status.
BMI data was organized according to the “BMI Classification Percentile And Cut Off Points” classification. The BMI criteria distributed in the study group are presented in the chart below.
As we could see, the majorty patients have overweight. This corresponds to disturbing reports from other studies regarding overweight. As we could read on the WHO page (https://www.who.int/news-room/fact-sheets/detail/obesity-and-overweight):
“In 2022, 2.5 billion adults (18 years and older) were overweight. Of these, 890 million were living with obesity. In 2022, 43% of adults aged 18 years and over were overweight and 16% were living with obesity.
In the studied group there is 1610 patients with overweight. Which constitutes ‘percentage’ of the entire group.
The chart below shows the distribution of patients with and without stroke depending on their smoking status.
| Patients without stroke | Patients with stroke | |
|---|---|---|
| formerly smoked | 814 | 70 |
| never smoked | 1802 | 90 |
| smokes | 747 | 42 |
| Unknown | 1497 | 47 |
The chart below shows the distribution of patients with and without stroke depending on occurence of hypertension.
| Patients without stroke | Patients with stroke | |
|---|---|---|
| formerly smoked | 765 | 120 |
| never smoked | 1660 | 232 |
| smokes | 695 | 94 |
| Unknown | 1492 | 52 |
The chart below shows the distribution of patients with and without stroke depending on occurence of hypertension.
| No Hypertension | Hypertension | |
|---|---|---|
| No Stroke | 4429 | 432 |
| Stroke | 183 | 66 |
A relationship between age and glucose levels has been observed. The chart below shows this relationship, including stroke patients and non-stroke patients alike.
The chart shows that high glucose levels are more common in older patients. This chart also differentiates between people with diagnosed stroke (orange dots) and those without stroke (green dots). It is clear that the number of patients with stroke increases with age and with increasing glucose levels. This is also confirmed by the results of the correlations: - age and glucose: r=0.24, p<0.001; - age and stroke: r=0.25, p<0.001; - stroke and glucose: r=0.12, p<0.001.
There were also differences between the sexes and the average blood sugar levels. This is illustrated in the 2 graphs below, differentiating between stroke and non-stroke patients.
#Conclusions #Future studies Future research should take into account more factors related to health behaviors, such as eating habits, physical activity, and alcohol consumption. In the future, clinical indicators as: waist-height ratio, visceral adipose issue, triglyceride-glucose index should also be considered. The use of blood thinning medications should also be considered.